Deep learning models of ultrasonography significantly improved the differential diagnosis performance for superficial soft-tissue masses: a retrospective multicenter study

Background Most of superficial soft-tissue masses are benign tumors, and very few are malignant tumors. However, persistent growth, of both benign and malignant tumors, can be painful and even life-threatening. It is necessary to improve the differential diagnosis performance for superficial soft-tissue masses by using deep learning models. This study aimed to propose a new ultrasonic deep learning model (DLM) system for the differential diagnosis of superficial soft-tissue masses. Methods Between January 2015 and December 2022, data for 1615 patients with superficial soft-tissue masses were retrospectively collected. Two experienced radiologists (radiologists 1 and 2 with 8 and 30 years’ experience, respectively) analyzed the ultrasound images of each superficial soft-tissue mass and made a diagnosis of malignant mass or one of the five most common benign masses. After referring to the DLM results, they re-evaluated the diagnoses. The diagnostic performance and concerns of the radiologists were analyzed before and after referring to the results of the DLM results. Results In the validation cohort, DLM-1 was trained to distinguish between benign and malignant masses, with an AUC of 0.992 (95% CI: 0.980, 1.0) and an ACC of 0.987 (95% CI: 0.968, 1.0). DLM-2 was trained to classify the five most common benign masses (lipomyoma, hemangioma, neurinoma, epidermal cyst, and calcifying epithelioma) with AUCs of 0.986, 0.993, 0.944, 0.973, and 0.903, respectively. In addition, under the condition of the DLM-assisted diagnosis, the radiologists greatly improved their accuracy of differential diagnosis between benign and malignant tumors. Conclusions The proposed DLM system has high clinical application value in the differential diagnosis of superficial soft-tissue masses. Supplementary Information The online version contains supplementary material available at 10.1186/s12916-023-03099-9.


Background
Superficial soft-tissue masses refer to various benign and malignant masses occurring in the superficial skin layer, subcutaneous tissue layer (fat, fibrous connective tissue and blood vessels), and muscle tissue layer [1] and present as subcutaneous masses of different sizes during palpation, which may be accompanied by pain, swelling, and dysfunction [2].The annual incidence of superficial soft-tissue masses is about 3‰, and the incidence has increased in recent years [3].Most are benign tumors, and very few are malignant tumors (less than 1%) [4].However, both benign and malignant persistent growth can cause pain and discomfort.Malignant masses that continue to develop may cause complications, such as pathological fractures, and may diffuse and become lifethreatening.In clinical practice, compared with benign and malignant classification, the difficulty in the diagnosis of soft-tissue masses lies in the benign classification, because benign has more than 70 subtypes and rarely displays typical imaging signs of each subtype in individuals, and the accuracy of diagnosis is strongly influenced by the radiologist's experience, so the accuracy rate of the most radiologists is less than 70%.Therefore, early detection and correct diagnosis are of great significance for the reasonable treatment and prognosis of superficial softtissue masses.
CT, MRI, and ultrasound can all be used for the examination of superficial soft-tissue masses.Among them, CT [5] has an ideal localization function and can show the relationship between tumor size, location, boundary, and surrounding tissues.However, its resolution on soft-tissue is low, and sometimes it is difficult to be qualitative, and it is also radioactive.Although MRI [6] can clarify the soft tissue structure, it is expensive and requires a long scanning time [7], both of which are not conducive for the diagnosis of superficial soft-tissue masses in clinical practice.In contrast, ultrasound has good soft-tissue resolution, is non-invasive, safe, non-radioactive, inexpensive, can be repeated multiple times, and can have a clinical palpation function during the examination, which is an incomparable advantage over other imaging methods, so it is the best method for the initial diagnosis of superficial soft-tissue masses [8][9][10][11][12][13].However, in clinical practice, the diagnosis of superficial soft-tissue masses mainly depends on the experience and ability of the radiologist, which is subjective.Therefore, an automated tool that can provide screening and auxiliary diagnosis of superficial soft-tissue masses is necessary to improve the diagnostic efficiency and accuracy of radiologists.
Different from traditional methods, deep learning radiomics (DLR) is an emerging technology based on data-driven learning, which can mine a large number of quantitative and high-throughput features that are difficult for human eyes to recognize from medical images for diagnosis and prognosis [14,15].However, the lesion edge in ultrasonic images is fuzzy, which is greatly affected by the operator, and it is difficult to manually define and extract features, and the reliability is poor [15].DLR can automatically extract medical image features by using a deep neural network structure, so the most significant advantage of DLR is that it does not need to manually extract features [14,15].
There have been many studies on deep learning radiomics based on ultrasonic images [15][16][17][18][19][20][21].All these studies have obtained satisfactory results, indicating that the establishment of a deep learning model is conducive to more efficient ultrasonic diagnosis.However, as far as we know, there is only one study [22] that applies artificial intelligence (AI) to ultrasound images to distinguish and identify superficial soft-tissue masses.This study was conducted on 419 patients in a single center, but its model had a small amount of data, simple model results, and poor benign identification performance.Therefore, more comprehensive studies with larger data cohorts are warranted to explore the differentiating performance of ultrasound-based DLR for superficial soft-tissue masses.
In this study, we retrospectively collected 1615 cases of superficial soft-tissue masses and aim to propose a new ultrasound deep learning model system consisting of two deep learning models (DLM-1 and DLM-2) for the classification and diagnosis of superficial soft-tissue masses.DLM-1 is trained to distinguish between benign and malignant masses, and DLM-2 is trained to classify the five most common benign superficial soft-tissue masses: lipomyoma, hemangioma, neurinoma, epidermal cyst, and calcifying epithelioma.Furthermore, we found data on superficial soft-tissue masses from two hospitals as an external test cohort to validate the performance of the model.In addition, in order to further verify the clinical application value of the model, we compared the DLM with the radiologists.

Patients
In this study, we retrospectively collected data for a total of 1615 patients with superficial soft-tissue masses from Peking University Third Hospital and two other hospitals from January 2015 to December 2022.This study was approved by the Institutional Ethics Committee (approval number: S2022674), and the need to obtain informed consent from patients was waived.
All effective cases included in the study must have pathological biopsy (histopathological findings) results as a factual basis for the type of mass to be objective.
Inclusion criteria were as follows: In this study, 20% of patients were randomly selected to be in the independent validation cohort, resulting in a 4:1 ratio of trained and validated patients.Stratified random sampling was used to ensure that model selections for the training and validation cohorts in this study were completely isolated, with a consistent proportion of patients responding and not responding.

Acquisition and analysis of US findings
For each patient, we collected 5-8 frames of grayscale image and CDFI for screening suitable images and finally used one grayscale image and one CDFI for training and evaluation of the DLM.Most of the research was done using a 7-14 MHz linear array probe for image acquisition on a HITACHI AIRETT 70 or GE LOGIQ E9 system under the default parameter conditions of the instrument.At the same time, comparative scan and dynamic scan should be carried out when the image is collected to compare and dynamically observe the boundary and scope of the lump.If the lump is deep or large, appropriate pressure should be applied.We ensured that each case contained at least one grayscale image and one CDFI.

Deep learning diagnostic and scoring models
A deep learning model system based on ultrasound images, including two deep learning models (DLM-1 and DLM-2), was developed for the differential diagnosis of superficial soft-tissue masses (Fig. 1).DLM-1 was used to distinguish between benign and malignant masses.DLM-2 consisted of five sub-models (SM-1, SM-2, SM-3, SM-4, and SM-5 used to identify lipomyoma, hemangioma, neurinoma, epidermal cyst, and calcifying epithelioma under benign conditions, respectively).All of these models are similar in structure to ConvNeXt networks [23,24] (see Additional file 1: Figure S1 and Method S1), with the only difference being the fully connected layer, to which we made some simple modifications to so that Fig. 1 The structure of deep learning model system.For each test case, our model utilizes ultrasound images as inputs each time, outputs superficial soft-tissue masses diagnostic task-related predictive probabilities and corresponding heatmaps to compare with and assist radiologists the network can adapt to current classification problems (see Additional file 1: Table S1).The input to each model is a grayscale ultrasound image or a CDFI image.The output of each model is the probability of each category from 0 to 1.
When applying these deep learning models, the region of interest (ROI) of each ultrasonic image is first extracted manually to avoid unnecessary text and graphic interference.Then, the ROI frame is adjusted to 470 × 280 based on the average size.In the training process, a series of data demonstration operations, including random scaling, random clipping, random flipping, and normalization, are needed to overcome overfitting.During the test, we directly resized each image to 470 × 280 and normalized each image in the same way as the normalization during the training.First, DLM-1 is applied to diagnose whether the sample is benign or malignant.If the sample is determined to be benign, further diagnosis is made by DLM-2, and the sub-model with the highest score gives the diagnosis.
The training cohort (n = 618) was used to train the model, the validation cohort (n = 154) was used to select the training hyperparameters and the best model during the training, and the test cohort A (n = 156) and the test cohort B (n = 123) were used to test the generalization performance of the model.It is important to note that in this study, the test queue was completely isolated from the training and model selection, so that they could be treated as two separate data cohorts.We used the preprocessed network weights on the ImageNet data cohort [25][26][27] as the initial weights.Finally, ultrasonic images were used to fine-tune the network weights.We used the same strategy to train SM-1, SM-2, SM-3, SM-4, SM-5, and DLM-1 (see Additional file 1: Method S2).

Radiologist study
We compared the results of the DLM for the identification of benign and malignant masses and the classification of benign masses with the diagnoses of two radiologists of different seniority with 30 and 8 years of clinical experience (Radiological-1 and Radiological-2).In the reader study, each radiologist evaluated grayscale and Doppler images of 58 patients in an internal test cohort, regardless of clinical history or patient demographics, and recorded his image-only diagnosis.According to the comparison results, the performance and clinical application value of the DLM were obtained.

Statistical analysis
Accuracy, specificity, sensitivity, positive predictive value (PPV), negative predictive value (NPV), and f1-score were calculated to show the diagnostic performance of the DLM (see Additional file 1: Method S3).The χ 2 test for independence was used to calculate P values for categorical variables (gender and mass type), and the one-way ANOVA was used to calculate P values for quantitative variables (age).The area under the receiver operating characteristic (ROC) curve (AUC) was used to estimate the performance of the DLM.For all tests mentioned above, a P value of < 0.05 was considered significant.The statistical analyses were performed using Python and SciPy.

Baseline characteristics
In this study, we retrospectively collected data for a total of 1615 patients with superficial soft-tissue masses, of which 564 patients were excluded due to exclusion criteria: (a) 214 cases, (b) 39 cases, (c) 57 cases, (d) 209 cases, (e) 45 cases.Finally, a total of 1051 cases were included for model training and verification (Fig. 2).
The data cohort of the Third Hospital of Beijing Medical University (Hospital-1 = 772) was randomly assigned as the training cohort (n = 618), the validation cohort (n = 154), the test cohort A (Hospital-2 = 156) of Beijing Civil Aviation General Hospital, and the test cohort B (Hospital-3 = 123) of Beijing Friendship Hospital Affiliated to Capital Medical University.Baseline characteristics of these patients are summarized in Table 1.There were certain statistical differences among benign data cohorts, which may be caused by large sample data and different hospitals.
Without knowing the exact results of the cases, the DLM assisted the two radiologists to re-diagnose the previous 58 cases: the differential results of benign and malignant masses were 89.7% (52/58) and 87.9% (51/58), respectively.The results of benign mass classification were 80.9% (34/42) and 73.8% (31/42), respectively.ROC curves of DLM validation cohort compared with the two radiologists (Fig. 3b-f ).

Interpretability of the DLM
In order to explore the interpretability of the DLM, we used gradient-weighted class activation mapping (Grad CAM) to visualize it [28] and found the areas of most concern of the DLM through a visualization algorithm, as shown in Fig. 4.
We randomly selected 130 patients in the internal dataset, used Grad CAM to display the areas of most concern of the DLM system, and then compared it with the areas of most concern of the two radiologists of different seniority.We found the following: in Radiologist-1, 28.4% (37/130) of the two areas of concern coincided completely; most overlapped 56.2% (73/130); a few overlapped 13.1% (17/130); there was 2.3% (3/130) that did not coincide at all.In Radiologist-2, there was 23.1% (30/130) that did not coincide at all.52.3% (68/130) mostly overlapped; 17.7% (23/130) overlapped in a small part; 6.9% (9/130) did not overlap at all.Table 4 summarizes the areas of concern of level of coincidence between the DLM and the two radiologists.

Discussion
This study evaluated the performance of a DLM system in the differential diagnosis of superficial soft-tissue masses, especially its value for less experienced and experienced radiologists.The DLM-assisted diagnosis was significantly helpful for the two radiologists.
DLM-1 and DLM-2 are two deep learning diagnostic models.DLM-1 was trained to distinguish between benign and malignant masses, and it can be seen from Table 2 that DLM-1 showed excellent performance.In the validation cohort, the AUC of DLM-1 reached an astonishing 0.992 (95% CI: 0.980, 1.0), and the ACC was 0.987 (95% CI: 0.968, 1.0), which highly indicated that the model was more accurate than the clinician in distinguishing benign from malignant masses.DLM-2 was trained to classify the five most common benign masses (lipomyoma, hemangioma, neurinoma, epidermal cyst, calcifying epithelioma), and the AUCs in the validation cohort were 0.986, 0.993, 0.944, 0.973, and 0.903, respectively.In test cohort B, the DLM performed slightly worse because the ultrasonic images were taken on machines of different make and model from those used in the other two centers.As can be seen from the above data, all the performance indexes of DLM-2 were about 0.9, indicating that DLM-2 had a strong ability in classifying five kinds of benign soft-tissue masses.The combination of the two models can accurately diagnose soft-tissue masses.It can be seen that deep learning is not subjective like humans, so it can accurately and stably carry out reasonable classification, avoiding the In the radiologist study, under the condition of DLMassisted diagnosis, the accuracy of diagnosis by the radiologist was greatly improved in both benign and malignant differentiation and benign classification, especially in benign classification.However, only in the diagnosis of calcifying epithelioma, the effect of elevation is not good; because the clinical radiologist's diagnosis accuracy is already high, DLM-assisted with no significant improvement.Also, with the help of the DLM, junior radiologists can achieve the diagnostic accuracy of senior radiologists.Thus, the DLM has certain clinical application value in assisting radiologists in the diagnosis of soft-tissue masses.We used Grad CAM to visualize the DLM.When comparing the areas of most concern identified by the DLM and those identified by the radiologists, we found there were many common areas of concern (the reasons why the proportion of complete or most overlap between the two was more than 75%).For example, (1) for malignant masses [29], both of them were very concerned about the rich blood flow inside the lesion (Fig. 4a); (2) for lipomyoma [30], both of them focused on the strong echo lines inside the lesion (Fig. 4b); (3) for hemangioma , both of them paid much attention to the obvious internal honeycomb structure and the enhanced echo behind the lesion (Fig. 4c); (4) for neurinoma [31,32], both of them focused on the "bright cap sign" of the lesions (Fig. 4d); (5) for epidermal cyst [33,34], both of them were very concerned about the enhancement of the echo behind the lesion (Fig. 4e); and (6) for calcifying epithelioma [35,36], both of them focused on the obvious attenuation of the echo behind the lesion (Fig. 4f ).
In addition, the two had many different concerns.For example, (1) for malignant masses, the radiologists focused on sharp but irregular edges of the lesion, while the DLM focused on the hyperechoic wrapping of unequal thickness around the lesion, which represents a large number of small interfaces after infiltration, which the radiologists did not pay sufficient attention to (Fig. 4a); (2) for lipomyoma, when there were not many thick lines, the DLM paid more attention to the thick lines; when there were many thin lines, the DLM paid more attention to the two thin lines that were very close together.More lines and fine lines indicate that there are many normal fascia lines in the lesion, meaning it is more likely to be benign, and there are fewer fascia lines in the malignant mass, which is really not generally paid attention to by ultrasound doctors (Fig. 4b).(3) For neurinoma, the DLM paid more attention to blood flow signals inside the lesions, indicating solid nodules (Fig. 4d); (4) for epidermal cyst, the DLM's focus was on the beginning of the lateral sound shadow, which means that the site is smooth and not easy for the radiologist to see at a glance (Fig. 4e).
We found that there were many similarities and differences between the DLM area of concern and the signs of the radiologist.For the similarities, the rationality and feasibility of the model can be further confirmed.At the same time, it can also help doctors quickly find the focus of the lesion area.For different points, it can provide clinicians with lesion areas to focus on in other points and provide new ideas for clinical diagnosis.This phenomenon may come from this reason: in terms of image labeling, we did not cover and sketch the boundary details of the lesion as traditional labeling did, but chose to use a wide range of field of view to intercept, which gave the model more space for self-discovery and learning.Compared with the traditional model, which only saw the details that the doctor wanted the model to see, our method may enable the model to discover the details that the doctor did not find.
Currently, the only relevant work is an artificial intelligence model proposed by Benjamin Wang et al. [22] to distinguish soft-tissue masses.Their model does a good job of distinguishing benign from malignant.However, their work has many limitations.First, the number of cases they collected was small (n = 419), and there were many cases without two-dimensional and color Doppler ultrasound images.Second, they had no external test cohort and were not verified by other hospital data, so the model performance results were not convincing.Third, although their model did a good job of distinguishing benign from malignant, it failed completely to identify the three benign masses and did not even mention benign differentiation in the study's conclusion.Finally, the artificial intelligence model applied in this study is simple in structure and low in efficiency, with low value for clinical application.However, we propose and verify that a DLM that addresses these deficiencies well and achieves excellent performance, establishing a more effective and clinically applicable model for the differentiation of soft-tissue masses.
The main limitation of the study relates to the reader study design of two specialist radiologists.In the reader study, the radiologist could only interpret selected static two-dimensional grayscale and CDFI images.In practice, radiologists can combine patient history, clinical symptoms, and real-time dynamic image information to obtain diagnosis results.The reader design of the study did not take this into account, which may have underestimated the performance of the radiologists.Another limitation is that due to the small number and wide variety of malignant cases, it is not possible to further distinguish malignant cases.In the future, we will use more clinical data collected to classify malignant masses, which may further improve the diagnostic performance of the DLM system.

Conclusions
In summary, we propose a new ultrasound deep learning model system, including two deep learning models (DLM-1 and DLM-2), with good performance for the classification and diagnosis of superficial soft-tissue masses.If this model is applied clinically, it may help to improve the accuracy of classification and identification of soft-tissue mass by the radiologist.Furthermore, it is helpful for improving the diagnostic efficiency of soft-tissue masses in the physical examination screening scenario and has high clinical application value.
(a) confirmed by histopathological findings of puncture biopsy or surgical excision; (b) the image is free from puncture needles and other external foreign bodies; (c) both two-dimensional grayscale images and color Doppler flow imaging (CDFI) images; (d) clear images with typical features.Exclusion criteria were as follows: (a) no histopathological findings; (b) interference by puncture needles and other external bodies; (c) only one image with twodimensional grayscale image and CDFI; (d) unqualified ultrasonic images; (e) soft-tissue benign masses in addition to the five benign masses studied in this study.

Fig. 3
Fig. 3 ROC curves of the DLM-1 and the DLM-2.ROC curves of the DLM in the train cohort, validation cohort, test cohort A, and test cohort B. a ROC curves of DLM-1.b ROC curves of lipomyoma in DLM-2.c ROC curves of hemangioma in DLM-2.d ROC curves of neurinoma in DLM-2.e ROC curves of epidermal cyst in DLM-2.f ROC curves of calcifying epithelioma in DLM-2.ROC, receiver operating characteristic curve; AUC, area under the receiver operator characteristic curve; DLM, deep learning model

Fig. 4 4
Fig. 4 Visualization of the DLM using the Grad-CAM.CDFI and activation maps of 6 types of lumps are shown.The strong response areas (red areas) are also the areas the DLM paid more attention to, which also means that these areas are more valuable for response prediction.The ovals represent the common areas of concern of the radiologist and the DLM; the squares represent the areas of greater concern of the radiologist; the triangles represent the areas of greater concern of the DLM.CDFI, color Doppler flow imaging; DLM, deep learning model

Table 1
Patient and tumor baseline characteristicsData are presented as n (%) or mean ± SD

Table 2
Diagnostic performance of DLM-1 Data in brackets are the 95% confidence interval Abbreviations: AUC area under the receiver operating characteristic curve, ACC accuracy, PPV positive predict value, NPV negative predict value, DLM deep learning model, training cohort (n = 617 individuals), validation cohort (n = 155 individuals), test A cohort (n = 156 individuals), test B cohort (n = 122 individuals)

Table 3
Diagnostic performance of DLM-2 Data in brackets are the 95% confidence interval Abbreviations: AUC area under the receiver operating characteristic curve, ACC accuracy, PPV positive predict value, NPV negative predict value, DLM deep learning model, training cohort (n = 584 individuals), validation cohort (n = 148 individuals), test cohort A (n = 151 individuals), test cohort B (n = 112 individuals)